Scaled Log Likelihood Ratios for the Detection of Abbreviations in Text Corpora

نویسندگان

  • Tibor Kiss
  • Jan Strunk
چکیده

We describe a language-independent, flexible, and accurate method for the detection of abbreviations in text corpora. It is based on the idea that an abbreviation can be viewed as a collocation, and can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a good recall, its precision is poor. We employ scaling factors which lead to a strong improvement of precision. Experiments with English and German corpora show that abbreviations can be detected with high accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Viewing sentence boundary detection as collocation identification

The detection of abbreviations is an important step in the process of sentence boundary detection. We describe a flexible, languageindependent and accurate method based on the idea that an abbreviation can be viewed as a collocation. As such, it can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a goo...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

Extending the Cochran rule for the comparison of word frequencies between corpora

We first describe a number of inter-related issues that need to be considered by the researcher when comparing frequencies of linguistic features in two or more corpora. We then describe the chi-squared and log-likelihood tests used in previous research for the comparison of word frequencies. Our focus, in this paper, is on the issue of reliability of the statistical tests, and we describe simu...

متن کامل

Asymptotic normality and analysis of variance of log-likelihood ratios in spiked random matrix models

The present manuscript studies signal detection by likelihood ratio tests in a number of spiked random matrix models, including but not limited to Gaussian mixtures and spiked Wishart covariance matrices. We work directly with multi-spiked cases in these models and with flexible priors on the signal component that allow dependence across spikes. We derive asymptotic normality for the log-likeli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002